{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating Detectors\n", "\n", "In `scikit-clean`, A `Detector` only identifies/detects the mislabelled samples. It's not a complete classifier (rather a part of one). So procedure for their evaluation is different.\n", "\n", "We can view a noise detector as a binary classifier: it's job is to provide a probability denoting if a sample is \"mislabelled\" or \"clean\". We can therefore use binary classification metrics that work on continuous output: brier score, log loss, area under ROC curve etc." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Suppress warnings, you should remove this before modifying this notebook\n", "def warn(*args, **kwargs):\n", " pass\n", "import warnings\n", "warnings.warn = warn\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.datasets import make_classification\n", "from sklearn.metrics import brier_score_loss, log_loss, roc_auc_score\n", "\n", "from skclean.tests.common_stuff import NOISE_DETECTORS # All noise detectors in skclean\n", "from skclean.utils import load_data \n", "from skclean.detectors.base import BaseDetector\n", "from skclean.simulate_noise import flip_labels_uniform" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class DummyDetector(BaseDetector):\n", " def detect(self, X, y):\n", " return np.random.uniform(size=y.shape)\n", "\n", "from skclean.detectors import KDN, RkDN\n", "class WkDN:\n", " def detect(self,X,y):\n", " return .5 * KDN().detect(X,y) + .5 * RkDN().detect(X,y)\n", " \n", "ALL_DETECTOTS = [DummyDetector(), WkDN()] + NOISE_DETECTORS" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X, y = make_classification(1800, 10)\n", "#X, y = load_data('breast_cancer')\n", "\n", "yn = flip_labels_uniform(y, .3) # 30% label noise\n", "clean_idx = (y==yn) # Indices of correctly labelled samples" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | log | \n", "brier | \n", "roc | \n", "
|---|---|---|---|
| DummyDetector | \n", "0.999 | \n", "0.333 | \n", "0.501 | \n", "
| WkDN | \n", "0.664 | \n", "0.183 | \n", "0.811 | \n", "
| ForestKDN | \n", "1.099 | \n", "0.131 | \n", "0.858 | \n", "
| InstanceHardness | \n", "0.448 | \n", "0.141 | \n", "0.902 | \n", "
| KDN | \n", "0.830 | \n", "0.173 | \n", "0.818 | \n", "
| RkDN | \n", "3.371 | \n", "0.227 | \n", "0.749 | \n", "
| MCS | \n", "0.294 | \n", "0.071 | \n", "0.955 | \n", "
| PartitioningDetector | \n", "0.942 | \n", "0.072 | \n", "0.950 | \n", "
| RandomForestDetector | \n", "0.464 | \n", "0.145 | \n", "0.908 | \n", "